Goto

Collaborating Authors

 framework tackle multiple vision-and-language task


12-in-1: Facebook AI's New Framework Tackles Multiple Vision-and-Language Tasks

#artificialintelligence

In recent years researchers in the busy deep learning, computer vision and natural language processing communities have all become increasingly interested in vision and language (V&L). A compelling reason to study language and vision jointly is the promise of language as a universal and natural interface for visual reasoning problems -- useful in both specifying a wide range of problems and communicating AI responses. However, previous research in visually-grounded language understanding have been mostly task-specific. Researchers from the Facebook AI Research, Georgia Institute of Technology, and Oregon State University found that the skills required for different V&L tasks such as visual question answering and caption-based image retrieval overlap significantly, thanks mainly to the rise of V&L general architectures. The wide variety of independent V&L tasks motivated these researchers explore ways to consolidate some of them -- and the result of their efforts is an all-in-one model that learns from 12 supporting datasets of four broad categories of V&L tasks.